Using the SAS System to Study the Gender and Level Measurement Equivalence of a Multi-rater Survey

نویسنده

  • Jim Penny
چکیده

This research used logistic regression to model item responses from a popular 360-for-development survey. The survey contained 57 items on 11 scales. The model used gender and rater group to identify items that exhibited differential item functioning (DIF). The rater groups were self, boss, peer, and direct report. The sample consisted of 752 survey families where a survey family consisted of a matched set of four surveys: one self, one boss, one peer, and one direct report. The sample of 3008 surveys contained 76% male and 24% female raters. The procedure to flag items exhibiting differential functioning used effect size computed from Wald chi-square statistics rather than statistical significance, resulting in fewer flagged items. Three items exhibited rating anomalies due to the gender of the rater or ratee. Twelve items exhibited DIF attributable to rater group. In each instance, the apparent effect of the DIF was small. An examination of the maximum likelihood parameter estimates suggested the rater group DIF was the possible result of hierarchical complexity. The DIF due to gender conformed to expectations of gender-related stereotypical interpretations of item text. This research further suggested that DIF due to environmental complexity could be a naturally occurring phenomenon in some 360-assessment, and that the interpretation of some 360-feedback might need to include the potential for such DIF to exist. INTRODUCTION There has been a veritable explosion in the use of 360assessment, a form of multi-rater assessment for managerial development in organizations. The process of 360-assessment involves providing managers with feedback from four sources: (1) the manager’s boss, (2) the manager’s subordinates or direct reports, (3) the peers or the customers of the target, and (4) the self. Although the notion of receiving multi-source feedback is not new, at least one premise of multi-rater methodology remains unresolved: Does the 360-process produce a similar measure from different rater groups just as a measuring tape produces a similar measure with different carpenters? Alternatively, does the 360-methodology provide an equivalent measure with each rater group? In addition to the increase in the use of 360-assessment, there continues to exist the question of whether or not women receive fair assessments of performance in the workplace, or do a variety of psychological and sociological biases influence the results of 360-methodology when applied to women? That is, does 360methodology provide an equivalent measure for both men and women, or do the gender biases that sometimes accompany performance appraisals function to influence the manner in which some raters interpret some items? Moreover, might there exist a potential interaction between rater group and gender of the ratee or between the gender of the rater and the gender of the ratee? DIFFERENCES AMONG RATER GROUPS Discussions of the differences between self and others’ ratings sometimes arise during 360-feedback sessions (Van Velsor & Leslie, 1991), making the existence of measurement equivalence important to interpretations of 360-feedback. It is possible, if not expected, that a feedback recipient will receive low ratings in one area from one rater source while receiving high ratings in that same area from another rater group. A manager, for example, may be interpersonally skilled with bosses yet cold and aloof with direct reports. This manager, therefore, could receive high ratings on interpersonal skills by the boss while receiving low ratings on this dimension by direct reports. However, in the interpretation of the between group difference, there is the underlying assumption that the raters are responding to their perceptions based on their observations of behaviors exhibited by the manager, and that two raters with similar observations will respond similarly to a given item even though the raters may occupy positions of different levels. CONTINGENCY THEORY Contingency theories of leadership (Fielder 1978; Fielder & Chemers, 1982) suggest that disparate ratings can be an indication of an effective manager, and that gaps in the perspectives between groups of raters are often a naturally occurring phenomenon of management. Moreover, Yukl (1981, pp. 99-119) suggested that managers often change their behavior to fit particular situations, and, following this line of argument, managers who behave differently toward different groups of coworkers may receive disparate ratings from members of those groups. Hence, between group differences may be an acceptable outcome for some managers. Concomitant with the interpretation of group differences in 360feedback is the expectation that a different interpretation of the item by one group of raters does not contribute substantially to the observed difference, and that the observed difference is only the result of behavioral differences produced by the circumstances of contingency. However, it seems reasonable to anticipate that some items may tap into differences produced by organizational contingency to a greater degree than do some other items. The ratings produced by items influenced by contingency, then, become composite scores comprised not only of an estimate of the managers standing on the trait measured by the survey but also of the degree to which contingency influenced the ratings. COMPLEXITY THEORY Jacques (1996) and Jacques & Clement (1994) suggested that the degree of environmental complexity and ambiguity seen by a person within an organization generally increases with rank. That is, a supervisor of the manager is likely to see a more complex and a more difficult to comprehend environment than is the direct report of the manager. For example, 360-surveys sometimes contain items that measure the resourcefulness of the manager, and one might argue that the increase in complexity from one level to another could produce different interpretations of what resourcefulness means. Hence, it seems reasonable to anticipate that differences in environment may influence the ratings given on a 360-survey. As with contingency theory, the interpretation of between-group differences in 360-ratings is posited on the expectation that an observed difference is solely a function of the behavioral differences witnessed by the raters. However, it seems reasonable to anticipate that some items may tap environmental complexity more so than other items. In that event, a rating difference produced by such items may represent a composite of not only the standing of the manager on the trait assessed by the items but of also the degree to which complexity influences the rater’s interpretation of that item. One might argue, then, that a manager reviewing 360-feedback could choose to make behavioral changes due not only to the behavioral observations of the raters but also, at least in part, to ratings produced by the anomalous functioning of some items. DIFFERENCES BETWEEN GENDERS Although far from conclusive, a condition indicative of the complex role gender plays in society, many studies have examined the influence of gender on ratings of managerial effectiveness over the past twenty-five years. Some studies have demonstrated statistically significant differences attributable to gender of the ratee (Bartol & Butterfield, 1976; Jacobson & Effertz, 1974; Rosen & Jerdee, 1974; Schmitt & Lapin, 1980). Other studies have failed to produce such differences (Pulakos & Wexley, 1983; Thompson & Thompson, 1983). In circumstances where a woman functions in a role often associated with men, one might expect to find to find differences attributable to the interaction of “role gender” and gender of the person filling the role. For example, a woman working as a firefighter might find herself at risk to receive performance reviews that carry not only an assessment of her performance but that also carry the influence of the interaction of her gender with the “gender” of the job. Of course, it stands to reason that men filling roles often associated with women will experience similar bias in performance reviews. Bartol & Butterfield (1976) and Rosen & Jerdee (1974) identified statistically significant interactions between role gender and person gender; however, Jacobsen & Effertz (1974) and Mobley (1982) failed to identify such interactions. The influence of gender is likely a composite of many factors, some of which may have small effects until they exist in concert with gender. Moreover, one could argue that raters are more likely to remember the gender of the manager long after forgetting particular exemplars of either good or bad behaviors. Such biases attributable to gender may influence ratings more than other factors and behaviors. For instance, Nieva & Gutek (1980) suggested that level of qualification, level of performance, degree of inference resulting from the ratings, and sex-role incongruence may each explain a portion of rating variability. Other explanatory factors also have arisen in the study of gender differences in managerial ratings. Cash, Gillen, & Burns (1977) suggested that some raters attribute a man’s success to ability while attributing the success of a woman to effort and luck. Greenhaus & Paurasuraman (1993) confirmed those findings, though only for women in the highest performance levels. At moderate levels of performance, they found that raters were likely to use ability to explain a woman’s success. In addition, one might also suggest that stereotypical behaviors and biases may influence performance ratings. Noe (1988) and Powell (1988) gave evidence to suggest that negative stereotypes against minorities and women can have a substantial impact on ratings of performance and effectiveness. Moreover, Martell (1991) found that if there existed less time to make an assessment of managerial performance, the performance of men was likely to receive higher ratings than comparable performances by women. Maurer & Taylor (1994) rendered this finding even more poignant when they demonstrated that the perceived masculinity of the ratee could produce higher ratings. Lastly, Powell & Butterfield (1989) suggested that the definition of “good manager” still carried connotation of masculinity despite the growing population of female managers. It seems reasonable to anticipate that some items will tap perceptual differences due to gender to a greater extant than will other items, and one might also suggest that particular items may tap particular gender-related biases and either increment or decrement differentially the resulting 360-ratings. In addition, one may ask if the differential functioning of the item is due to the gender of the rater, the gender of the ratee, the interaction of the two genders, all three or some other combination. Moreover, do either or both genders interact with the rater group? RESEARCH QUERSTIONS This research sought to establish the degree to which differential item functioning attributable to rater group and gender may influence the ratings of a 360-survey. That is, will a given manager receive similar ratings from the boss, a direct report, and a peer if those three other raters have had similar experiences with the manager? Are there components in item ratings attributable to the gender of the rater or to the gender of the ratee? Are there items that function differently for particular combinations of rater and ratee gender? Is there evidence to support the existence of an interaction between the gender of ratee and the rater group? Moreover, if such items exist, does an explanatory model exist using extant measurement and psychological theory? Finally, if such items exist and if such explanatory models exist, what, then, may be the subsequent implications for the interpretation of the 360-feedback. METHODOLOGY This research used logistic regression to detect DIF. Swaminathan & Rogers (1990) first presented this methodology and demonstrated its relationship to the Mantel-Haenszel procedure (Mantel & Haenszel, 1959; Holland & Thayer, 1988). Swaminathan & Rogers (1990) and Clauser & Mazor (1998) have shown that logistic regression is equal in power to the MantelHanzsel procedure for the detection of uniform DIF. Moreover, these same authors with Penny & Johnson (1999) have shown that the Mantel-Haenszel procedure may lack sufficient statistical power to detect some instances of nonuniform DIF. However, Rogers (1989), Rogers & Swaminathan (1993), and Swaminathan & Rogers (1990) found that logistic regression procedures likely to have sufficient power to detect non-uniform DIF. Much of the initial research in the use of logistic regression for the detection of DIF involved the examination of dichotomous items; that is, items with two possible responses, usually 0 and 1. However, logistic regression is easy to extend to polytomous data where the respondent chooses one of an ordered set of responses. Samejima (1969, 1979) presented the Graded Response Model that describes such item responses which are common on 360-surveys. For example, the Graded Response Model describes a Likert-type item using a 5-point scale of 1=Strongly Disagree to 5=Strongly Agree positing a response function for each point on the scale according to

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Score Generalizability of Writing Assessment: the Effect of Rater’s Gender

The score reliability of language performance tests has attracted increasing interest. Classical Test Theory cannot examine multiple sources of measurement error. Generalizability theory extends Classical Test Theory to provide a practical framework to identify and estimate multiple factors contributing to the total variance of measurement. Generalizability theory by using analysis of variance ...

متن کامل

Analysis of Effective Factors on Psychological Empowerment of Employees

This study was aimed at investigating the impact of organizational factors, leadership style, reward system and job design on psychological empowerment and its dimensions. This study was also aimed at measuring the level of employees' psychological empowerment. Applied research method was survey-correlation using questionnaire as the measurement tool. Statistical population included all employe...

متن کامل

Multi-objective Measurement Devices Allocation Using State Estimation in Distribution System

Allocation of measurement devices is a necessity of distribution system which is an application of state estimation. In this paper, the problem of active and reactive measurement devices is modeling using a multi-objective method. The objectives of the problem are to minimize the use of measurement devices, increase in state estimation output, improve the state estimation quality and reduce cos...

متن کامل

A Measurement of the Status of Frontiersmen’ Media Literacy: A Case Study of Cross-border Iranian Citizens of West Azerbaijan Province

: The aim of this study is to assess the media literacy status of cross-border citizens of West Azerbaijan province in terms of the status of their media literacy, and their ability to receive, evaluate, influence, and analyze media messages (as one of the influential factors). Method: This is an applied study in terms of purpose that was conducted using a descriptive-survey method. The rese...

متن کامل

کاربرد مدل چندسطحی در تعیین عوامل موثر بر طول مدت اقامت بیماران آپاندکتومی

 Background: Because of the limitations concerning health centers and hospitals in Iran, the length of stays at the hospitals is of high importance. The purpose of this research is to investigate the factors affecting the length of stay among appendectomy patients at Social Security Organization hospitals using a multilevel model. Methods: We presented an applied-analytical study which i...

متن کامل

Rater Errors among Peer-Assessors: Applying the Many-Facet Rasch Measurement Model

In this study, the researcher used the many-facet Rasch measurement model (MFRM) to detect two pervasive rater errors among peer-assessors rating EFL essays. The researcher also compared the ratings of peer-assessors to those of teacher assessors to gain a clearer understanding of the ratings of peer-assessors. To that end, the researcher used a fully crossed design in which all peer-assessors ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001